ieee robotic and automation letter
Gaussian Variational Inference with Non-Gaussian Factors for State Estimation: A UWB Localization Case Study
Stirling, Andrew, Lukashchuk, Mykola, Bagaev, Dmitry, Kouw, Wouter, Forbes, James R.
This letter extends the exactly sparse Gaussian variational inference (ESGVI) algorithm for state estimation in two complementary directions. First, ESGVI is generalized to operate on matrix Lie groups, enabling the estimation of states with orientation components while respecting the underlying group structure. Second, factors are introduced to accommodate heavy-tailed and skewed noise distributions, as commonly encountered in ultra-wideband (UWB) localization due to non-line-of-sight (NLOS) and multipath effects. Both extensions are shown to integrate naturally within the ESGVI framework while preserving its sparse and derivative-free structure. The proposed approach is validated in a UWB localization experiment with NLOS-rich measurements, demonstrating improved accuracy and comparable consistency. Finally, a Python implementation within a factor-graph-based estimation framework is made open-source (https://github.com/decargroup/gvi_ws) to support broader research use.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
Super4DR: 4D Radar-centric Self-supervised Odometry and Gaussian-based Map Optimization
Li, Zhiheng, Wang, Weihua, Shen, Qiang, Zhao, Yichen, Fang, Zheng
Conventional SLAM systems using visual or LiDAR data often struggle in poor lighting and severe weather. Although 4D radar is suited for such environments, its sparse and noisy point clouds hinder accurate odometry estimation, while the radar maps suffer from obscure and incomplete structures. Thus, we propose Super4DR, a 4D radar-centric framework for learning-based odometry estimation and gaussian-based map optimization. First, we design a cluster-aware odometry network that incorporates object-level cues from the clustered radar points for inter-frame matching, alongside a hierarchical self-supervision mechanism to overcome outliers through spatio-temporal consistency, knowledge transfer, and feature contrast. Second, we propose using 3D gaussians as an intermediate representation, coupled with a radar-specific growth strategy, selective separation, and multi-view regularization, to recover blurry map areas and those undetected based on image texture. Experiments show that Super4DR achieves a 67% performance gain over prior self-supervised methods, nearly matches supervised odometry, and narrows the map quality disparity with LiDAR while enabling multi-modal image rendering.
ShelfAware: Real-Time Visual-Inertial Semantic Localization in Quasi-Static Environments with Low-Cost Sensors
Agrawal, Shivendra, Brawer, Jake, Naik, Ashutosh, Roncone, Alessandro, Hayes, Bradley
Many indoor workspaces are quasi-static: global layout is stable but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside MCL, yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. Across 100 global-localization trials spanning four conditions (cart-mounted, wearable, dynamic obstacles, and sparse semantics) in a semantically dense, retail environment, ShelfAware achieves a 96% success rate (vs. 22% MCL and 10% AMCL) with a mean time-to-convergence of 1.91s, attains the lowest translational RMSE in all conditions, and maintains stable tracking in 80% of tested sequences, all while running in real time on a consumer laptop-class platform. By modeling semantics distributionally at the category level and leveraging inverse proposals, ShelfAware resolves geometric aliasing and semantic drift common to quasi-static domains. Because the method requires only vision sensors and VIO, it integrates as an infrastructure-free building block for mobile robots in warehouses, labs, and retail settings; as a representative application, it also supports the creation of assistive devices providing start-anytime, shared-control assistive navigation for people with visual impairments.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.90)
Flow-Aided Flight Through Dynamic Clutters From Point To Motion
Xu, Bowen, Yan, Zexuan, Lu, Minghao, Fan, Xiyu, Luo, Yi, Lin, Youshen, Chen, Zhiqiang, Chen, Yeke, Qiao, Qiyuan, Lu, Peng
Challenges in traversing dynamic clutters lie mainly in the efficient perception of the environmental dynamics and the generation of evasive behaviors considering obstacle movement. Previous solutions have made progress in explicitly modeling the dynamic obstacle motion for avoidance, but this key dependency of decision-making is time-consuming and unreliable in highly dynamic scenarios with occlusions. On the contrary, without introducing object detection, tracking, and prediction, we empower the reinforcement learning (RL) with single LiDAR sensing to realize an autonomous flight system directly from point to motion. For exteroception, a depth sensing distance map achieving fixed-shape, low-resolution, and detail-safe is encoded from raw point clouds, and an environment change sensing point flow is adopted as motion features extracted from multi-frame observations. These two are integrated into a lightweight and easy-to-learn representation of complex dynamic environments. For action generation, the behavior of avoiding dynamic threats in advance is implicitly driven by the proposed change-aware sensing representation, where the policy optimization is indicated by the relative motion modulated distance field. With the deployment-friendly sensing simulation and dynamics model-free acceleration control, the proposed system shows a superior success rate and adaptability to alternatives, and the policy derived from the simulator can drive a real-world quadrotor with safe maneuvers.
Spatiotemporal Calibration and Ground Truth Estimation for High-Precision SLAM Benchmarking in Extended Reality
Shu, Zichao, Bei, Shitao, Li, Lijun, Chen, Zetao
Simultaneous localization and mapping (SLAM) plays a fundamental role in extended reality (XR) applications. As the standards for immersion in XR continue to increase, the demands for SLAM benchmarking have become more stringent. Trajectory accuracy is the key metric, and marker-based optical motion capture (MoCap) systems are widely used to generate ground truth (GT) because of their drift-free and relatively accurate measurements. However, the precision of MoCap-based GT is limited by two factors: the spatiotemporal calibration with the device under test (DUT) and the inherent jitter in the MoCap measurements. These limitations hinder accurate SLAM benchmarking, particularly for key metrics like rotation error and inter-frame jitter, which are critical for immersive XR experiences. This paper presents a novel continuous-time maximum likelihood estimator to address these challenges. The proposed method integrates auxiliary inertial measurement unit (IMU) data to compensate for MoCap jitter. Additionally, a variable time synchronization method and a pose residual based on screw congruence constraints are proposed, enabling precise spatiotemporal calibration across multiple sensors and the DUT. Experimental results demonstrate that our approach outperforms existing methods, achieving the precision necessary for comprehensive benchmarking of state-of-the-art SLAM algorithms in XR applications. Furthermore, we thoroughly validate the practicality of our method by benchmarking several leading XR devices and open-source SLAM algorithms. The code is publicly available at https://github.com/ylab-xrpg/xr-hpgt.
- Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.69)
- Information Technology > Artificial Intelligence > Vision > Video Understanding (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
PPL: Point Cloud Supervised Proprioceptive Locomotion Reinforcement Learning for Legged Robots in Crawl Spaces
Ma, Bida, Xu, Nuo, Qi, Chenkun, Liu, Xin, Mo, Yule, Wang, Jinkai, Lu, Chunpeng
--Legged locomotion in constrained spaces (called crawl spaces) is challenging. In crawl spaces, current proprioceptive locomotion learning methods are difficult to achieve traverse because only ground features are inferred. In this study, a point cloud supervis ed RL framework for proprioceptive locomotion in crawl spaces is proposed . A state estimation network is designed to estimate the robot's collision states as well as ground and spatial features for locomotion . A point cloud feature extraction method is proposed to supervise the state estimation network . The method uses representation of the point cloud in polar coordinate frame and MLP s for efficient feature extracti on. Experiments demonstrate that, compared with existing methods, our method exhibits faster iteration time in the training and more agile locomotion in crawl spaces. This study enhances the ability of leg ged robots to traverse constrained spaces w ithout requiring exteroceptive sensors. N recent years, legged robots have demonstrated remarkable terrain traversal capabilities, exhibiting significant application value.
CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving
Qiao, Zhijian, Yu, Zehuan, Li, Tong, Chou, Chih-Chung, Ding, Wenchao, Shen, Shaojie
Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (0.84)
- Information Technology > Robotics & Automation (0.70)
A Cross-Embodiment Gripper Benchmark for Rigid-Object Manipulation in Aerial and Industrial Robotics
Vagas, Marek, Varga, Martin, Romancik, Jaroslav, Majercak, Ondrej, Suarez, Alejandro, Ollero, Anibal, Vanderborght, Bram, Virgala, Ivan
Abstract--Robotic grippers are increasingly deployed across industrial, collaborative, and aerial platforms, where each embodiment imposes distinct mechanical, energetic, and operational constraints. Established YCB and NIST benchmarks quantify grasp success, force, or timing on a single platform, but do not evaluate cross-embodiment transferability or energy-aware performance, capabilities essential for modern mobile and aerial manipulation. This letter introduces the Cross-Embodiment Gripper Benchmark (CEGB), a compact and reproducible benchmarking suite extending YCB and selected NIST metrics with three additional components: a transfer-time benchmark measuring the practical effort required to exchange embodiments, an energy-consumption benchmark evaluating grasping and holding efficiency, and an intent-specific ideal payload assessment reflecting design-dependent operational capability. T ogether, these metrics characterize both grasp performance and the suitability of reusing a single gripper across heterogeneous robotic systems. A lightweight self-locking gripper prototype is implemented as a reference case. Experiments demonstrate rapid embodiment transfer (median 17.6 s across user groups), low holding energy for gripper prototype ( 1.5 J per 10 s), and consistent grasp performance with cycle times of 3.2-3.9 CEGB thus provides a reproducible foundation for cross-platform, energy-aware evaluation of grippers in aerial and manipulators domains. Robotic grasping has been extensively investigated across industrial, collaborative, and aerial domains.
MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction
Keskar, Maitrayee, Trivedi, Mohan, Greer, Ross
We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.
- Transportation > Ground > Road (0.51)
- Information Technology (0.37)
- Automobiles & Trucks (0.37)
Quality-guided UAV Surface Exploration for 3D Reconstruction
Sportich, Benjamin, Boubakri, Kenza, Simonin, Olivier, Renzaglia, Alessandro
Abstract-- Reasons for mapping an unknown environment with autonomous robots are wide-ranging, but in practice, they are often overlooked when developing planning strategies. Rapid information gathering and comprehensive structural assessment of buildings have different requirements and therefore necessitate distinct methodologies. In this paper, we propose a novel modular Next-Best-View (NBV) planning framework for aerial robots that explicitly uses a reconstruction quality objective to guide the exploration planning. In particular, our approach introduces new and efficient methods for view generation and selection of viewpoint candidates that are adaptive to the user-defined quality requirements, fully exploiting the uncertainty encoded in a Truncated Signed Distance field (TSDF) representation of the environment. This results in informed and efficient exploration decisions tailored towards the predetermined objective. We demonstrate that it successfully adjusts its behavior to the user goal while consistently outperforming conventional NBV strategies in terms of coverage, quality of the final 3D map and path efficiency. Autonomous exploration for 3D reconstruction is a fundamental task in robotics, with critical real-world applications such as infrastructure inspection, mapping, environmental monitoring, and search-and-rescue missions [1]. In these scenarios, Unmanned Aerial V ehicles (UA Vs) equipped with onboard visual and range sensors have proven essential for efficiently navigating complex environments and capturing aerial perspectives that enable comprehensive 3D reconstructions or a swift survey of an area.